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Abstract 

Single-agent reinforcement learners in time- extended 
domains and multi- agent systems share a com- 
mon dilemma known as the credit assignment problem. 
Multi-agent systems have the structural credit assign- 
ment problem of determining the contributions of a par- 
ticular agent to a common task . Instead , time- extended 
single- agent systems have the temporal credit assign- 
ment problem of determining the contribution of a par- 
ticular action to the quality of the full sequence of 
actions. Traditionally these two problems are consid- 
ered different and are handled in separate ways. In this 
article we show how these two forms of the credit as- 
signment problem are equivalent In this unified frame- 
work, a single- agent Markov decision process can be 
broken down into a single-time-step multi-agent pro- 
cess. Furthermore we show that Monte-carlo estimation 
or Q-leaming ( depending on whether the values of re- 
sulting actions in the episode are known at the time 
of learning) are equivalent to different agent util- 
ity functions in a multi- agent system. This equivalence 
shows how an often neglected issue in multi- agent sys- 
tems is equivalent to a well-known deficiency in 
multi-time- step learning and lays the basis for solv- 
ing time-extended multi- agent problems , where both 
credit assignment problems are present. 


1. Introduction 

The structural credit assignment problem of deter- 
mining how a single agent's actions contributes to a sys- 
tem that involves the actions of many agents is inher- 
ent in multi- agent domains. For a reinforcement learn- 
ing agent to learn properly, this credit assignment prob- 
lem needs to be resolved and the agent needs to receive 
the appropriate reinforcement. Robotic soccer is a well 
studied domain that clearly exhibits this form of credit 
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assignment problem, where learning algorithms need to 
judge a particular player's role in achieving the over- 
all goal of the mutli-agent system of winning the game 
[61. This structural credit assignment problem has been 
studied in other domains including foraging robots [5], 
network routing [15] and bimatrix games [3]. In these 
systems the credit assignment problem w^as handled im- 
plicitly by creating a reward structure that credited an 
agent's role in performance of a larger system. 

In a single-agent domain, the temporal credit assign- 
ment problem is concerned with how an action taken 
at a particular time step affects the final outcome. For 
example, if a player wins a game of checkers, it may 
be difficult for that player to determine which of his 
many moves were the most important in helping him 
win, and which moves may have actually been detri- 
mental. Many reinforcement learning algorithms have 
been derived to assign proper credit assignment includ- 
ing Q-learning, Sarsa and TD(A)[13, 8, 10]. The goal 
of these algorithms is to make the learner converge to 
the correct policy, in a speedy manner, or to at least 
make a good tradeoff between correctness and speed. 

Tills paper poses the single- agent time-extended 
problem as a multi-agent single-time-step problem, 
transforming the temporal credit assignment problem 
into a structured credit assignment problem. This credit 
assignment problem is then solved with multi-agent 
utilities, where credit is assigned through agent-specific 
utility functions. In our solution an agent evaluates its 
role in the outcome of a global utility function over all 
agents through a private utility function that is both 
“aligned” with the global utility, yet is sensitive to the 
agent's actions. In many cases the multi-agent solution 
is equivalent to popular reinforcement learners. Show- 
ing this equivalence is beneficial in many ways: i) It al- 
lows users to pose many problems either as a structural 
or a temporal credit assignment problem, and choose 
the one that is best suited for the domain; ii) it high- 
lights potential pitfalls of some approaches to struc- 



tural credit assignment by expressing their problems 
as well known deficiencies of temporal credit assign- 
ment algorithms; iii) it lays the basis for deriving prin- 
cipled solutions to time-extended multi- agent systems, 
where both credit assignment problems are present. 

In this work. Section 2 describes the structural credit 
assignment problem and presents a solution in terms of 
learnable private utilities that are aligned with a global 
utility. Section 3 summarizes relevant issues in the stan- 
dard temporal credit assignment problem, where a sin- 
gle agent needs to determine how' an action affects 
the entire sequence of rewards received after that ac- 
tion. Section 4 then show's how the two credit assign- 
ment problems can be unified by transforming a single- 
agent multi-time-step problem into a structural credit 
assignment problem, allowing temporal credit assign- 
ment problems to be posed as structural credit assign- 
ment problems. Sections 5, 6 and 7 then show how the 
new structural credit assignment problem can be solved 
using three utilities presented in section in 2 . The appli- 
cation of two of the utilities illustrates the relationship 
between two popular temporal credit assignment prob- 
lem methods and the multi-agent concepts of utility 
alignment, learnability and system observability. The 
application of the third utility shows how a subtle pit- 
fall common in mutli-agent utilities relates to an obvi- 
ous problem in multi-time step systems. 

2. Structural Credit Assignment 

Many structural credit assignment problems in 
multi-agent systems have been successfully ad- 
dressed with multi-agent utilities [4, 15, 11, 1, 7, 12]. 
This section summarizes how to create an effec- 
tive multi- agent utility that has two important prop- 
erties: It is both aligned with a global utility over all 
agents, and it is easy for the agents to learn. These 
properties will be called “factoredness” and “learn- 
ability” respectively. Results from this section will 
be used later to show how a multi-time-step sin- 
gle agent system can be cast as a single-time-step 
multi- agent problem. To avoid the confusion by over- 
loading the word “agent”, we will use the world 
“agent” exclusively in the single-agent problem. In- 
stead we will use the term “node” in place of “agent” 
in the mutli- agent extensions we discuss, and in par- 
ticular we will call such systems “multi-node sys- 
tems.” 

The goal of the multi-node system is to maximize 
a world utility function, G(z ), which is a function of 
the joint move of all nodes in the system, z. Instead of 
maximizing G(z) directly, each node, ry, tries to maxi- 
mize its private utility function g v {z). The goal of 


this section is to solve the structural credit assignment 
problem in this context, i.e., to create private utility 
functions that lead the multi-node system to high val- 
ues of G(z). Note that in many systems an individual 
node rj will only influence some of the components of z. 
We will use the notation z v to refer to the parts of the 
state that is influenced by the actions of 77 . The vec- 
tor z v is the same size as z and is equal to z except 
that all the components that rj does not influence are 
set to zero. In this notation z - z v expresses the states 
of all the nodes other than 77 . This notation is used in- 
stead of standard vector decomposition to make the ad- 
dition and subtraction of vector components explicit. 

2.1. Node Utilities 

While each node is trying to maximize its private 
utility, as a whole we want the system to try to maxi- 
mize the global utility. To do this, we want the nodes’ 
private utilities to be aligned with the global utility. 
We call such an aligned utility a “factored utility.” For- 
mally, a private utility is factored when for each agent 
ry: 

9 v (z) > g n (z') ^ G(z) > G(z') 

Vz, z' s.t. z - z v = z - z' n . 

Intuitively, for all pairs of vectors z and z f that differ 
only for agent 77 , a change in 77 ’ s state that increases its 
private utility cannot decrease the world utility. 

As a trivial example, any system in which all the 
private utility functions equal G is factored [2]. How- 
ever such systems often suffer from low signal-to-noise, 
a problem that get progressively worse as the size of the 
system grows. This is because for large systems where 
G sensitively depends on all components of the sys- 
tem, each agent may experience difficulty discerning 
the effects of its actions on G. As a consequence, each 
77 may have difficulty achieving high g v . We call this 
signal/noise effect learnability. Intuitively learnability 
is the ratio of the sensitivity of the utility to rj's ac- 
tions, to the sensitivity of the utility to the actions of 
all other agents. So at a given state z, the higher the 
learnability, the more g v (z). depends on the move of 
agent 77 , i.e., the better the associated signal-to-noise 
ratio for 77 . 

2.2. Fully Observable Difference Utility 

A factored utility that has been shown to be easier 
to learn than the global utility in domains where z is 
fully observable is the difference utility [15], given 
by: 

BU v (z) = G(z) - G(z - z v ) . 


(i) 



This utility is factored because the second term does 
not depend on the state of 77, and thus the only way 
7 ] can change the value of the difference utility is by 
changing the value of the first term, which is the global 
utility. Intuitively, the second term of the difference 
utility is the value of the global utility without node 
77. The difference utility then quantifies node rj ? s con- 
tribution to the global utility. In addition to being fac- 
tored it can be proven that in many circumstances, es- 
pecially in large problems, that DU^ has higher leam- 
ability than does the global utility [15]. This is mainly 
due to the second term of the DU V , which removes a lot 
of the effect of other agents (i.e., noise) from rfs util- 
ity. 

The fully observable difference utility has been suc- 
cessfully applied to various domains, including 
packet routing over a data network [15], the conges- 
tion games [16], data downloads from a constellation 
of satellites [14], and multi- agent gridworlds [11]. 

2.3. Partially Observable Difference Utility 

In many cases, it may be impossible to compute 
DU v (z) (or G(z)) because some of the component val- 
ues of z are unknown to node 77. We will denote the 
component of z that is known by 77 using the vector z 0t > 
and the part of z that is unknown to 77 using the vec- 
tor z hv . The vector z 0lJ is the same as z except that all 
the elements that are unknown to 77 are set to zero. We 
call the known components the observable compo- 
nents of the worldline. The vector z hri conversely con- 
tains all of the values that are not observable. The vec- 
tor z is the sum of these two vectors: z — z 0r > 4 - z hv . If r 
does not equal z° v , then node 77 may not be able com- 
pute DU v (z) directly. Instead we must approximate it 
using the information in z 0t? . One way to do* this is 
to simply use z 0ri as the parameter to the difference 
utility. We will call this utility the truncated differ- 
ence utility (TDU) since the non-observable compo- 
nents are essentially truncated out 1 . The TDU is given 
by: 

TDU v (z) = j DUniz 0 ") = G{z°") - G(z°*> - z*») . (2) 

This utility has been shown to be highly learnable, 
since the second term removes much of the noise caused 
by the actions of other agents [1]. In fact TDU V in gen- 
eral is even more learnable than DU V since z 0r > is likely 
to contain information pertinent to node rj whereas z 
may contain a lot of irrelevant information. The main 
problem with this utility however is that it is not fac- 
tored with respect to G(z). This utility is only factored 


insofar as G(z 0t? ) approximates G(z). This means that 
a node could take actions that improve the value of its 
TDU r „ yet reduce the value of the global utility. 

Another alternative to truncating out the non- 
observable components is to estimate the value of the 
difference utility given observable components. We 
will call this utility the estimated difference util- 
ity (EDU) which is given by: 

EDU v (z) = E[DU„(z)\z^) 

= E\G{z)\z°') - E[G(z - z„) |z**] ,( 3 ) 

where E['\z 0r ^} is the expected value over non-observed 
states. While this utility is also not factored with re- 
spect to G(z ), in general E[G(z)\z°‘ n ] is closer to G{z) 
than G(z 0r >) is, and hence is more likely to be factored 
than TDU [1]. Any action that an agent takes to in- 
crease the value of the EDU must increase the value 
of E[G{z)\z 0r ^}, since its actions are removed from the 
second term of the EDU . When a good estimate is 
used, an action that increases the value of E[G{z)\z ° *»], 
is very likely to increase the value of the global util- 
ity, G{z). Similarly an action that increases the value 
of TDU must also increase the first term of the TDU , 
G{z° r '). However when many of the components of z 
are not observable, there are many possible actions that 
will increase <?(z 0tj ), but will not increase G(z), due to 
interactions with non-observable components. 

3. Temporal Credit Assignment Prob- 
lem 

In a typical temporal credit assignment problem, an 
agent takes a sequence of actions, transitions through a 
sequence of states, and receives a sequence of rewards. 
The global utility for such a system in the episodic case 
is the undiscounted sum of rewards: 

G(z) = J 2 R t (z). ' (4) 

t 

We use the undiscounted version for simplicity, since 
this paper uses an episodic “finite-horizon” model of 
learning where discounting is not needed. When learn- 
ing is not episodic, discounting must be used to avoid 
infinite sums. In such systems the utility at a time 
step is the discounted sum of future rewards, G(z) = 

7 t Rt(z), where 7 is the discount factor in the range 
[0 1]. We will focus on problems where an agent chooses 
its actions based on the estimates of future rewards 
stored in a “Q- table.” This Q-table is indexed by all 
the possible states and actions, where the value Q(s , a) 
is the estimation of the sum of future rewards when ac- 
tion a is taken in state s. The credit assignment prob- 
lem in this case consists of determining how an action 


1 This utility is called “TTU” in [1] . 



at time step t. a t: affect all of the rewards after time 
step t. 

Now let us summarize versions of two reinforcement 
learning methods that address this problem for sim- 
ple deterministic domains: First- visit Monte-carlo esti- 
mation and Q- learning [ 9 ]. With Monte-carlo estima- 
tion, an action is given credit for all the subsequent re- 
wards. Therefore the Q-table estimate of the future re- 
wards that resulting from action a t in state s t is based 
on the rewards obtained after the action was taken. 
In deterministic Monte-carlo estimation, the Q-table 
value, estimating the undiscounted sum of future re- 
wards after action a t is taken in state s* is: 

Q mc{ s u ^t) ~ ^ ^ Rt* 5 ( 5 ) 

t'>t 

where Qmc is the Q-table for the Monte-carlo estima- 
tor. Monte-carlo estimation works best when the value 
of the future rewards obtained after an action are very 
dependent on that action. However, in some domains 
many of the values of rewards received after time t may 
not be dependent on the action a t . In such cases the 
Q-table estimate for action a t in state s t may contain a 
lot of noise since it includes reward values that are pri- 
marily a function of future actions. If the future ac- 
tions change, the value of action a t in state s t may be 
very different. In essence, the temporal credit assign- 
ment problem is only partially solved by this method. 

In contrast to Monte-carlo estimation, a Q-learner 
only gives full credit to the immediate reward, and in- 
stead uses other Q- values to estimate the values of fu- 
ture rewards. In Q-learning the Q-table value for ac- 
tion a t taken in state s t is; 

QQL(s t ,at) * Rt + max(5Qi(s t+ i,a) , (6) 

a 

where Qql is the Q-table used by the Q-learner. Agents 
using Q-learning can often learn more quickly than 
agents using Monte-carlo learning, when an agents ac- 
tion has much more influence on its immediate reward 
than future rewards. In addition the Q-tables can be 
updated after every action with Q-learning, without 
waiting for the end of the episode. Note that for sim- 
plicity we use the deterministic form for both learning 
methods, because their update rules are cleaner in de- 
terministic problems. However the conclusions of this 
paper do not depend on this determinism. In the next 
section we will show how these two reinforcement learn- 
ing methods can be seen as forms of difference utilities. 

4. Unified View of Credit Assignment 

In this section we show how any Markov Decision 
Process (MDP) for a single-agent system can be posed 


as a single-time-step multi-node problem. In a single- 
agent MDP system there is a set of states and transi- 
tions between the states. The agent starts in a start- 
state and then transitions through the state-transition 
diagram, receiving rewards depending on the transi- 
tions taken. In some cases the agent will continue mak- 
ing transition until it reaches an absorbing state. In- 
stead, this paper will use a “finite horizon” model, 
where an agent moves for a fixed number of time steps 
after starting in its start-state. One of the most impor- 
tant properties, of an MDP is that they are “memory- 
less” in that the expected value of future rewards that 
will be received after an agent enters a state will be in- 
dependent of the set of states that the agent was in be- 
fore. The best transition that an agent can take there- 
fore can be determined solely on its current state. Fig- 
ure 1 (top) shows a simple MDP with only four states 
with two transitions per state. 

This section turns the MDP into a multi-node prob- 
lem by first assigning a node to every state of the MDP. 
The notation 5(77) is used to indicate the state for a 
node 77. An action for a node 77 corresponds to a tran- 
sition out of state <5(77). The nodes’ actions determine 
the transitions that the agent wall take from its state. 
Note that the agent now simply follows the “actions” 
of the nodes and performs no learning of its own. All 
the learning takes place in the nodes. The actions of all 
of the nodes, z, therefore define the path an agent will 
take through the state-transition diagram of the MDP, 
given the agent’s start- state. The sum of rewards re- 
ceived on this path define the global utility, G{z). These 
nodes have a simple single-time-step learning task of 
mapping their immediate private utility values to their 
actions. This mapping can be stored in a simple single- 
state Q-table over actions for each node. In a deter- 
ministic domain, the update rule for this Q-table after 
taking action a is simply: 

Qv( a ) = 9 rj > ( 7 ) 

where g n is the utility the node is trying to maximize. 
Note that this is a much smaller Q-table than the ones 
used in Q-learning and Monte-carlo estimation, since 
it is only for one state. 

An example multi-node system is shown in figure 
1 (middle) where each of the four states in the MDP 
corresponds to one of four nodes. The action vector 
z = [l 1 1 2} t encodes a set of four actions, one per 
node. For this action vector, the agent in the MDP 
transitions right twice receiving a reward of R3 and 
i? 4 . The global utility for z, G(z), is therefore R$ 4- P 4 . 
Computing global utility for the actions vector z — z v 
will require more definitions. Consider the case wffere 
we want to compute the utility G(z — z v ) for node 3 in 
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Figure 1. Four State MDP. (Top) Agent starts in 
State 2 and makes transitions for two time steps, 
receiving two rewards. (Middle) In multi-node 
version, a node is assigned to each state. A node’s 
action is a choice of a transition. The choice is en- 
coded as either a 1 (right-transition) or a 2 (left- 
transition) in the action vector, z. (Bottom) The 
action vector z — z m has a zero in its third ele- 
ment. This corresponds to a “zero action" tran- 
sitioning the agent to an absorbing state. 


the example, where z — z v = [ 110 2] T . The third ele- 
ment of this action vector is zero, which does not corre- 
spond to any transition. The transition the agent would 
take out of the third state is undefined. We will there- 
fore define this “zero” action to correspond to a “null” 
transition, which will always return a reward of zero 
and will- transition into an absorbing state as shown 
in Figure 1 bottom. All rewrards after this transition is 
taken will have a value of zero. The value of G(z — z v ) 
is therefore equal to R $ , since the agent takes the right- 
action from state 2 , receiving a reward of R$ and then 
takes the null transition from state 3. Note that other 
definitions can be made for the zero- action, and de- 
pending on the encoding, the zero-action may refer to 


an actual transition. However the results showm in this 
paper are based on a transition to an absorbing state, 
and assume that actions are encoded so that an ac- 
tion encoded as a zero never refers to an actual transi- 
tion in the MDP. 

Let us now’ describe the learning that takes place 
in this MDP. This paper uses an episodic model of 
learning w’here the agent starts at a start-state at the 
beginning of an episode and moves according to the 
transitions available to the MDP. If the agent is us- 
ing Monte-carlo estimation or Q-learning, it makes the 
decisions about w’hich transition to take from a state, 
based on the Q-table values associated wdth that state. 
An agent using Monte-carlo estimation will update the 
Q-table values at the end of the episode, based on the 
reward values received during the episode. An agent 
using Q-learning w’ill update the Q-table values during 
the course of an episode. In the multi-node version of 
the MDP, each node w’ill perform a single non- null ac- 
tion at the beginning of the episode. This action w T ill be 
determined from the small Q-table used by each node, 
which contains estimate values only for the nodes state. 
Each node w’ill then update its Q-table using its pri- 
vate utility, either at the end of the episode or during 
the episode, depending on the private utility used. 

As an example take the gridworld problem (Figure 
2). In this classic problem, the agent can move from 
grid square to grid square, until it reaches a terminal 
state. The agent can move in four directions, and the 
state is determined by which grid square the agent is 
in. This problem can be broken dowm into nodes, w’here 
each grid square is assigned a node. At the beginning of 
the episode each node independently chooses an action 
from one of four possible moves. The MDP agent then 
follows the actions chosen by the nodes, until it reaches 
the terminal square. All the nodes associated wdth grid 
squares that the gridworld agent actually went through 
are updated at the end of the episode. 

If one of the node’s actions is replaced with the null 
action, then the gridworld agent will not progress be- 
yond the node’s state as showm in Figure 3. In this 
example, the node associated with the third state en- 
tered has its actions changed from the move-right ac- 
tion to the null action, corresponding to its component 
in the action vector z — z v . Whereas the reward sum- 
mation for G(z) sums eight rewards, the rew’ard sum- 
mation for G(z — z v ) will include only the first two re- 
wards, since the rewrards received after the second time 
step have a value of zero. In a sense G(z — z v ) gives the 
quality of the path up to node rj. Therefore difference 
of these two utilities G{z) — G(z — z v ) gives the qual- 
ity of the path past node rj. Since anything that hap- 
pened before the agent entered node 77 does not affect 
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Figure 2. Classic Gridworld Problem broken down 
into multiple nodes. Each grid square is assigned 
a node, which chooses an action at the beginning 
of each episode. The agent then follows these ac- 
tions. Nodes that were visited by the agent are 
updated (black arrows). 


the value of G(z) — G(z - z v )^ that utility can be inter- 
preted as the contribution of node rj to the full path. 
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Figure 3. When z is changed to z — z v the ac- 
tion of node r\ (third square right of agent's start- 
ing place) is changed from the move-right action 
to the "null” action. With this change, the grid- 
world agent no longer moves along the dotted 
line and instead moves into an absorbing state, 
receiving rewards of zero. 


5. DU and Monte-Carlo Estimation 

If all of the rewards of an episode of N time steps 
are known when the private utility for a node is com- 
pute, the DU (Equation 1) can be used as the node’s 
private utility: 

DU v (z) = G{z)-G(z-z v ) (8) 

N N 

= ~Y, R ^ Z - ZT! ^ W 

t = 1 t— 1 


N T(r/)-i 

= E^)- E 

t= i t = l 

N 

- R t( Z - ~~TW) :(10) 
t=T( V ) 

where T(tj) is the first time the agent entered state 
S(rj). For times before T(t?), the action taken by node 
77 is irrelevant so R t (z — ZT(rj)) equals Rt{z) for all 
t < T(rj). Therefore we can rewrite DU as: 

N T{r))-l 

DU v (z) = E^‘W- E R *(*) ( n ) 

t=l t- 1 

N 

- E R t( z ~ Z T( V )) (12) 

t=T( v ) 

N 

= E (-Rt(z) - Rt{z - z T(v))) -(13) 

t=T(v) 


In addition all rewards Rt(z — z T(rj)} past time T{rj) are 
zero because the action at time T{rj) is now the null- 
action. Therefore we can simply write DU as: 

N 

DU v (z) = R t ( z )' ( 14 ) 

t*T(T 7) 

The difference utility for node 77 5 s action is therefore 
the same as the undiscounted Monte-carlo estimation 
that would be received in a single- agent system for tak- 
ing an action in state S(rj). 

6. TDU and Immediate Rewards 

Using the difference utility in this problem requires 
a node to know all of the future actions of an episode. 
This is an issue with Monte-carlo estimation, where 
an episode has to be completed before learning is per- 
formed. However in many reinforcement learning do- 
mains we want learning to be performed immediately, 
even before the future actions are taken. If the fu- 
ture actions of an episode are unknown, we can use 
the TDU instead of the DU by including only the cur- 
rent and previous actions in the observable components 
of z. The TDU for node 77 can be computed as follow’s: 

TDU v (z) = DU v (z°«) (15) 

= E R ^°”) ( l6 ) 

t=T(v) 

N 

= iW* 0 ")+ E R ^ z ° n ) ( 17 ) 

t=T{r ,)+ 1 





Since the future actions are unknown, all actions past 
time T(rj) are not observable. Therefore the actions in 
the z 0ri corresponding to actions that occur past time 
T(tj) are null- actions from the definition of Re- 
wards that are a function of z° v therefore have a value 
of zero past time T(rj). The TDU therefore simply re- 
duces to the immediate reward: 

TDU v (z) = R r{v) (z°«) (18) 

This utility is unsatisfactory since to properly evaluate 
the quality of an action, a node needs to see the conse- 
quences of that action on future rewards. It is equiva- 
lent to Monte- carlo estimation with infinite discounting 
(7 = 0), wdiich is clearly unacceptable in most domains. 
Note that the unacceptability of this utility is not obvi- 
ous from its definition in Section 2. In fact even though 
this utility is not factored, it seemed promising due to 
its high learnability. The use of non-factored utilities is 
a potential problem in many multi- agent system appli- 
cations. Unfortunately the downside of a non-factored 
utility is not as obvious as the downside of infinite dis- 
counting. The results in this section shows that they 
are equivalent, and that non-factored utilities have to 
be carefully evaluated in multi-agent domains, since the 
complexity of the multi- agent system may hide their 
danger. 

7. EDU and Q-learning 

Instead of the TDU, the EDU can be used as the 
utility for node 77 when the future actions in an episode 
are unknown. In this system the EDU is computed as 
follows: 

EDU V (z) = E[DU T j(z)\z° tj ] 

N 

= J2 

t=T( V ) 

N 

= e[r T{v) (z)\z°”} + ei J 2 R tW\* 0 ']- 

t=T( V )+ 1 

Since the reward at time T(rj) is completely determined 
from the observable components, E[Rt( v )(z)\z 0vj ] 
equals R T ^(z), leaving us with the problem of es- 
timating the sum of future rewards for an action. 
This estimate can be made by keeping a record of 
all of the rewards received for all the episodes. Us- 
ing the rewards from previous episodes, a node, 77, 
can estimate ^tLrw+i by looking at the se- 

quence of rewards received the last time an agent went 
through state St v + i- However recording all of these re- 
wards is unnecessary since the relevant rewards are 


summarized in the Q- tables of other nodes. As- 
suming the state entered after taking an action 
at time T(rf) is known, this estimate of future re- 
wards can be obtained from Q-table values of the node 
used in the next state, and we can write EDU as fol- 
lows: 

EDU V (z) = R nv) (z°) + E[Q v ,(z t{v)+ i) |z*»] ( 19) 

where is the next action after time T(rj) and 

rf is the node corresponding to the state entered at 
time T(rj) -f 1. Since zt( v )+i is not observable, the esti- 
mation of Qrj'(zT(ri)+i) will depend on the exploration 
method used for each node. However for most explo- 
ration methods, as the rate of exploration approaches 
zero, the action correspond to the highest Q- value will 
always be used, resulting in the following computation 
of EDU: 

EDUrj(z) — R T ( v )(z°) -f ma x^/(a) , (20) 

where a is a possible action from state 5(77'). In this sit- 
uation the EDU for node 77 therefore provides the Q- 
learning estimate for action z v in state 5(77). Note that 
we started out trying to make the “on-policy” estimate 
of what the sum of future rewards actually will be given 
the actions taken. However since none of the future ac- 
tions were known we ended up with the “off-policy” 
Q-learning estimate instead. The on-policy Sarsa type 
estimate cannot be used in this situation because the 
action taken after time T(rj) is not observable. 

8. Discussion 

This paper unifies the structural credit assignment 
problem present in single-time-step multi- agent sys- 
tems and the temporal credit assignment present in 
time-extended single-agent systems. It does this by 
showing the relationship between the three utilities, 
DU, EDU and TDU to the three different reinforce- 
ment learning methods, monte-carlo estimation, Q- 
learning and immediate reward learning, respectively. 
In each case the relation between the utility and the 
reinforcement learner is made through a specific con- 
struction of a multi-node system. The structural credit 
assignment view of temporal credit assignment high- 
lights the salient properties of these methods along 
with their deficiencies. This view shows how the use 
of non-factored utilities commonly used in multi-agent 
systems is equivalent to a multi-time-step utility gener- 
ally not considered in that field. In its place this paper 
shows how factored and close to factored utilities re- 
late to successful reinforcement learning methods. 

This paper has discussed only single-time-step 
multi-agent problems and multi-time-step single-agent 



problems. The multi-time-step multi-agent prob- 
lem is more difficult, since both temporal and struc- 
tural credit assignment problems have to be dealt with 
at once. However using the method described in Sec- 
tion 4 it is possible to view a multi- time-step 
multi-agent problem as only a structural credit assign- 
ment problem. 

For example take the gridworld problem shown in 
Figure 2. A multi-agent version of this problem can 
be defined by allowing many agents to move on the 
same grid. The reward for a time-step in the multi- 
agent gridworld would then be a function of all the 
grid-squares the agents occupy at that time step. This 
can be turned into a multi-node single- time-step prob- 
lem by assigning a node to each agent/grid-square pair. 
So if there are n grid-squares and m agents there would 
be nm nodes in the multi-node formulation. The dif- 
ference utility can then be derived for each node in the 
same way it is done in Section 5. up to equation 13: 

N 

DU n (z) = y (. R t (z ) - R t (z - zr M )) . 

t=T(r}) 

The key difference in the multi- agent version is how we 
define the effects of the null- act ion for an agent. In Sec- 
tion 4 it was defined so that the rewards R t {z - z T ( 7? )) 
had a value of zero past time T(rj). This made sense in 
a single-agent system where we are summing rewards 
for a single agent. In the multi- agent version we have a 
double summation over time and agents (^t? Yt ^nd) 
and this sum will not be zero if a single agent is taken 
out. Instead any reduction in the term Rt{z—zx( v )) will 
be system dependent. Our current research focuses on 
exploiting this coupling to allow faster convergence and 
better performance in multi-time-step multi-agent sys- 
tems. 
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